{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **演示0301：基本统计量**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **案例1：百分位数**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">**实现百分位数计算**  \n",
    ">* 一组$n$个元素的数组升序排列后，处于$x\\%$位置的值称为第$x$百分位数\n",
    ">* 计算方法：\n",
    ">  * 数字升序排列\n",
    ">  * 计算下标索引：$i=(n-1)*x\\%$  考虑到下标索引从0开始，因此用$n-1$来计算\n",
    ">  * 如果$i$是整数，则直接返回$i$对应的元素值\n",
    ">  * 如果$i$不是整数，则将$i$向上向下分别取整得到两个相邻的下标索引$floor\\_i$和$ceil\\_i$，然后根据$i$与这两个索引之间的距离进行插值计算，返回相应结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.25\n",
      "5.5\n",
      "7.75\n"
     ]
    }
   ],
   "source": [
    "''' 自定义数组的百分位数计算 '''\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "arr = np.arange(1, 11, 1)    # [1,2,3,4,5,6,7,8,9,10]\n",
    "\n",
    "def my_percentile(array, percentage):\n",
    "    sorted_array = sorted(array)\n",
    "    p_index = percentage / 100 * (len(sorted_array) - 1)\n",
    "    floor_p_index = np.floor(p_index)\n",
    "    ceil_p_index = np.ceil(p_index)\n",
    "    \n",
    "    if floor_p_index == ceil_p_index: # p_index是整数\n",
    "        return sorted_array[int(p_index)]\n",
    "    else:\n",
    "        v0 = sorted_array[int(floor_p_index)] * (ceil_p_index - p_index)\n",
    "        v1 = sorted_array[int(ceil_p_index)] * (p_index - floor_p_index)\n",
    "        return v0 + v1\n",
    "\n",
    "print(my_percentile(arr, 25))\n",
    "print(my_percentile(arr, 50))\n",
    "print(my_percentile(arr, 75))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">**中位数和四分位数计算**  \n",
    ">* 中位数实际上是$50\\%$分位数，可以通过*np.median*来计算\n",
    ">* 统计中经常用到的是$25\\%$、$50\\%$、$75\\%$分位数，分别称为第1、第2和第3四分位数，记为：$Q1,Q2,Q3$\n",
    ">* $Q1$和$Q3$之间距离的一半，称为**四分位差**，记为$Q$。$Q$越小，说明中间部分的数据越集中；反之则越分散\n",
    ">* *np.percentile*可直接用于计算百分位数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5.5\n",
      "3.25\n",
      "7.75\n"
     ]
    }
   ],
   "source": [
    "''' 调用库函数计算四分位数 '''\n",
    "\n",
    "arr = np.arange(1, 11, 1)    # [1,2,3,4,5,6,7,8,9,10]\n",
    "print(np.median(arr))\n",
    "print(np.percentile(arr, 25))    # 25%百分位数\n",
    "print(np.percentile(arr, 75))    # 75%百分位数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **案例2：偏差相关的计算**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">**数组偏差的概念及计算**  \n",
    ">* 偏差(*deviation*)：数组中每个元素与数组平均值的差，其结果仍然是数组\n",
    ">* 计算公式：$ dev(x)=x^{(i)}-\\bar{x}, i=1,2,\\cdots,n $ \n",
    ">  * $x^{(i)}$表示第$i$个元素\n",
    ">  * $\\bar{x}$表示数组的算数平均值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[-4.5, -3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5, 4.5]\n"
     ]
    }
   ],
   "source": [
    "''' 计算数组的偏差 '''\n",
    "\n",
    "arr = np.arange(1, 11, 1)\n",
    "\n",
    "def my_deviation(array):\n",
    "    avg = np.mean(array)\n",
    "    return [el - avg for el in array]\n",
    "\n",
    "print(my_deviation(arr))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">**方差的概念及计算**  \n",
    ">* 方差(*variance*)：每个元素的偏差平方和，再除以有效元素个数\n",
    ">* 计算式：$ var(x) = \\dfrac{\\sum_{i=1}^{n} (x^{(i)} - \\bar{x} )^2}{n} $ 其中，$n$为有效元素个数\n",
    ">  * 无偏估计方差(*unbiased estimator*)：有效元素个数为【数组长度-1】\n",
    ">  * 有偏估计方差(*biased estimator*)：有效元素个数为【数组长度】\n",
    ">* 用于计算方差的函数：*np.var(arr)*\n",
    ">  * *ddof=1*，计算无偏方差\n",
    ">  * *ddof=0*，计算有偏方差(默认)\n",
    ">* 方差的意义\n",
    ">  * 方差指示了一个数组中各个元素的离散程度\n",
    ">  * 方差越大，离散程度越大；方差越小，说明各个数据都比较集中在均值附近"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9.166666666666666\n",
      "9.166666666666666\n",
      "8.25\n",
      "8.25\n"
     ]
    }
   ],
   "source": [
    "''' 自定义有偏和无偏方差计算函数 '''\n",
    "\n",
    "def my_unbiased_variance(array):\n",
    "    deviations = my_deviation(array)\n",
    "    return sum([el**2 for el in deviations]) / (len(array) - 1)\n",
    "\n",
    "def my_biased_variance(array):\n",
    "    deviations = my_deviation(array)\n",
    "    return sum([el**2 for el in deviations]) /  len(array)\n",
    "\n",
    "print(my_unbiased_variance(arr))   \n",
    "print(np.var(arr, ddof=1))      # ddof=1表示计算无偏方差\n",
    "print(my_biased_variance(arr))     \n",
    "print(np.var(arr, ddof=0))      # ddof=0表示计算有偏方差"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">**标准差的概念及计算**  \n",
    ">* 标准差(*standard variance*)：又叫**均方差**，也就是方差的平方根\n",
    ">* 分为有偏估计和无偏估计标准差\n",
    ">* 计算式：$ std(x)=\\sqrt{var(x)} $\n",
    ">* 用于计算标准差的函数：*np.std(arr)*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.0276503540974917\n",
      "2.8722813232690143\n"
     ]
    }
   ],
   "source": [
    "''' 调用库函数计算标准差 '''\n",
    "\n",
    "print(np.std(arr, ddof=1))     # 无偏标准差\n",
    "print(np.std(arr, ddof=0))     # 有偏标准差"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">**协方差的概念及计算**  \n",
    ">* 协方差(*covariance*)：对于长度相同的两个数组，先求出各自的偏差，然后求出两组偏差的内积和，最后除以有效元素总数\n",
    ">* 计算式：$ cov(x, y)=\\dfrac{\\sum_i^n(\\bar{x} - x^{(i)})(\\bar{y} - y^{(i)})}{n} $\n",
    ">  * $\\bar{y}$：数组$y$的平均值\n",
    ">  * $\\bar{x}$：数组$x$的平均值\n",
    ">  * $x^{(i)}$：数组$x$中第$i$个元素\n",
    ">  * $y^{(i)}$：数组$y$中第$i$个元素\n",
    ">* 协方差的意义\n",
    ">  * 两个数组(或向量)的协方差越大，说明二者之间的相互线性影响越明显\n",
    ">* 用于计算协方差的函数：*np.cov(A)*\n",
    ">  * $A$代表一个$(M,N)$的矩阵或二维数组\n",
    ">  * 返回协方差矩阵$C$，其中 $C[i,j]$表示一维数组$A[i]$与$A[j]$的协方差"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5.0\n",
      "5.0\n",
      "2.5\n",
      "10.0\n",
      "[[ 2.5  5. ]\n",
      " [ 5.  10. ]]\n"
     ]
    }
   ],
   "source": [
    "''' 根据定义来结算数组之间的协方差 '''\n",
    "\n",
    "def my_unbiased_covariance(arrayX, arrayY):\n",
    "    deX = my_deviation(arrayX)\n",
    "    deY = my_deviation(arrayY)\n",
    "    return sum([el1 * el2 for el1, el2 in zip(deX, deY)]) / (len(arrayX) - 1)\n",
    "\n",
    "arr1 = [1,2,3,4,5]\n",
    "arr2 = [2,4,6,8,10]\n",
    "print(my_unbiased_covariance(arr1, arr2))\n",
    "print(my_unbiased_covariance(arr2, arr1))\n",
    "print(my_unbiased_covariance(arr1, arr1))\n",
    "print(my_unbiased_covariance(arr2, arr2))\n",
    "print(np.cov([arr1, arr2], ddof=1))    # 查看协方差矩阵"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">**使用矩阵运算来计算协方差矩阵**  \n",
    ">* 假设有$(M,N)$维度的矩阵$A$，每一行代表一个数组，现在需要计算各行数组之间的协方差\n",
    ">* 可以采用下列步骤：\n",
    ">  * $A$的每一行元素减去该行平均值\n",
    ">  * 计算$AA^{T}$，这相当于分别计算了每一行与包括自己在内的其它行的内积和\n",
    ">  * 将上述计算结果矩阵中的每个元素除以数组有效元素个数(也就是每行的元素个数)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[2. 4.]\n",
      " [4. 8.]]\n",
      "[[2. 4.]\n",
      " [4. 8.]]\n"
     ]
    }
   ],
   "source": [
    "''' 使用矩阵运算计算数组之间协方差 '''\n",
    "arr1 = [1,2,3,4,5]\n",
    "arr2 = [2,4,6,8,10]\n",
    "A = np.array([arr1, arr2])\n",
    "# 计算每行的平均值，但是仍然保持(2,5)的维度不变（每行中的5个元素都是该行的平均值)\n",
    "mean_matrix = np.mean(A, axis=1, keepdims=True)  # axis=1，表示分别计算每行的平均值;keepdims=True，表示保持(M,N)数组维度不变\n",
    "# A中每个元素都减去所在行的平均值\n",
    "dev_matrix = A - mean_matrix\n",
    "cov_matrix = dev_matrix.dot(dev_matrix.T)/A.shape[1]\n",
    "print(cov_matrix)\n",
    "print(np.cov(A, ddof=0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">**相关性计算**  \n",
    ">* 相关性(*correlation*)：对于两个数组(元素个数相同)，计算出协方差以及两个标准差，然后用协方差除以标准差乘积\n",
    ">* 计算式：$cor(x, y)=\\dfrac{cov(x,y)}{std(x) * std(y)} $\n",
    ">* 相关性的意义\n",
    ">  * 相关性用于衡量两个数组的关联度\n",
    ">  * 1表示最强的正关联；-1表示最强的负关联；0表示完全没有关联\n",
    ">* 用于计算相关性的函数：*np.corrcoef(A)*\n",
    ">  * $A$代表一个$(M,N)$的矩阵或二维数组\n",
    ">  * 返回相关性矩阵$C$，其中$C[i,j]$表示$A[i]$与$B[j]$行的相关性"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9999999999999998\n",
      "-0.9999999999999998\n",
      "-0.196977338422431\n",
      "[[ 1.          1.         -1.         -0.19697734]\n",
      " [ 1.          1.         -1.         -0.19697734]\n",
      " [-1.         -1.          1.          0.19697734]\n",
      " [-0.19697734 -0.19697734  0.19697734  1.        ]]\n"
     ]
    }
   ],
   "source": [
    "''' 计算数组之间相关性的例子 '''\n",
    "\n",
    "def my_unbiased_correlation(arrayX, arrayY):\n",
    "    deX = np.std(arrayX, ddof=1)\n",
    "    deY = np.std(arrayY, ddof=1)\n",
    "    if(deX > 0 and deY > 0):\n",
    "        return my_unbiased_covariance(arrayX, arrayY) / (deX * deY)\n",
    "    else:\n",
    "        return 0\n",
    "\n",
    "arr1 = [1,2,3,4,5]\n",
    "arr2 = [2,4,6,8,10]\n",
    "arr3 = [-2,-4,-6,-8,-10]\n",
    "arr4 = [6, -100, 3, 20, -90]\n",
    "print(my_unbiased_correlation(arr1, arr2))      # 1.0，表示最强正相关\n",
    "print(my_unbiased_correlation(arr1, arr3))      # -1.0，表示最强负相关\n",
    "print(my_unbiased_correlation(arr1, arr4))      # -0.19，表示相关性很小\n",
    "A = np.array([arr1, arr2, arr3, arr4])      # 构造4行元素的矩阵\n",
    "print(np.corrcoef(A)) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **协方差、相关性、独立性之间的关系**\n",
    "对于两个数组$a$和$b$：  \n",
    "* 如果$cov(a, b)=0$，则必有$cor(a, b)=0$，即$a$和$b$没有线性关系\n",
    "* 反过来，如果$a$和$b$完全没有线性关系，则它们的协方差和相关性也为0\n",
    "* 协方差或相关性为0，并不意味着$a$和$b$完全独立，因为它们可能有非线性关联，例如：$a$基于函数$sin(x)$，$b$基于函数$cos(x)$ \n",
    "* 如果$a$和$b$是独立的，那么意味着它们既无线性关系，也无非线性关系；所以协方差和相关性都为0"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
